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Speech recognition for tonal languages 





(57) The present invention relates to a method and 
device at speech-to-text conversion. From a given 
speech the fundamental tone is extracted. A model of 
the speech is further created from the speech. In the 
model a duration reproduction in words and sentences 



is obtained. The duration reproduction is compared with 
a segment duration in the speech. From the comparison 
is obtained information which decides which type of ac- 
cent that exists, at which a text with sentence accent 
information is produced. 
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Description 

TECHNICAL FIELD 

The present invention relates to speech-to-text- 
conversion. Especially is referred to the possibility to an- 
alyse a given speech and from this obtain information 
about different accents as well as stressed and un- 
stressed syllables in a speech. This information is of im- 
portance at interpretation of the given speech. 

PRIOR ART 

In the speech recognition systems which are uti- 
lized at present with for instance HMM. fundamental 
tone and duration information are regarded as distur- 
bances. Information regarding sentence accent types, 
respective stressed or unstressed syllables have in the 
known applications been performed on the basis of sta- 
tistical methods. The information which is obtained via 
the accentuation in the speech has at that not been pos- 
sible to identify. 

In patent document US 5220639 is described a 
speech recognition at mandarine Chinese characters. A 
sequence of single syllables is recognized by separate 
recognition of syllables and mandarine tones and 
putting together recognized parts for recognition of the 
single syllable under utilization of hidden markov mod- 
els. The recognized single syllable is used by a markov 
Chinese language model in a linguistic decoder section 
for determination of corresponding Chinese character. A 
tone pitch frequency detector is utilized. The tone pitch 
frequency detector detects characteristics regarding the 
pitch frequency of the unknown signal and transmit it to 
one for the tone recognition included personal compu- 
ter in which Markov model probabilities for the five dif- 
ferent tones are calculated. 

In patent document US 4852170 is described lan- 
guage translation under utilization of speech recognition 
and syntesis. Each speech segment is logically ana- 
lysed for identification of phoneme class affiliation. After 
that the frequency spectrum of the segment is analysed 
for identification of specific phonemes within the type. 

In patent document US 4489433 is described a 
speech information transmission by means of telex 
equipment. After the transmission, speech data can be 
converted into a readable message of characters. The 
technology according to the document is principally in- 
tended to be applied at the Japanese language. The ac- 
cent type of Japanese words is a tone pith accent and 
can be identified depending on the position of the point 
of time between the syllables at which the tone pitch fre- 
quency is changed abruptly to a low frequency. The 
word accent code indicates a sudden change in tone 
pitch and fundamental tone frequency, usually caused 
by the accent of a special syllable in a word. 

Patent document US 4178472 deals with a speech 
instruction identification system which suggests com- 
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ble sounds. The fundamental tone frequency is used as 
a symbolic value for speech/sound. 

Patent document EP 180047 relates to recognition 
of spoken text and following printing. For each segment 
of the recognized spoken text a corresponding string of 
characters is stored. Lexical information is utilized. 

DESCRIPTION OF THE INVENTION 

TECHNICAL PROBLEM 



At speech recognition there is a need to identify dif- 
ferent sentence accents and stressed respective un- 
'5 stressed syllables in words and sentences. Methods or 
devices to generally appoint different types of accent re- 
spective stressed/unstressed syllables have so far been 
lacking. The prosodic information has so far not been 
used at speech recognition but is regarded as a distur- 
bs" bance at the statistical methods which are used. The 
prosodic information is necessary at advanced speech 
understanding systems at speech-to-speech transla- 
tion. By analysing the prosodic information and appoint- 
ing the location of the accents and the types of the ac- 
25 cents in words and sentences is obtained an increased 
understanding of the given speech and a possibility to 
translate it better between different languages. Problem 
further exist to appoint stressed/unstressed syllables in 
words and sentences. By the ability to identify the loca- 
30 tion of stressed respective unstresed syllables in words 
and sentences is also given an increased possibility to 
identify the real meaning of a sentence. Consequently 
there exists a need to identify said parameters and uti- 
lize these in connection with speech recognition. 
35 The aim with the present invention is to indicate a 
method and device for identification of the proper sense 
of a given speech. 
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THE SOLUTION 



The present invention relates to a method at 
speech-to-text conversion where the fundamental tone 
is extracted from a speech. From the speech is created 
a model of the speech. From the model is obtained a 

-*5 duration reproduction in words and sentences. The du- 
ration reproduction is compared with a segment dura- 
tion in the speech. From the comparison is decided 
which type of accent that exists and a text with sentence 
accent information is produced. Sentence accents of 

50 type 1 and 2 are discernible. Further, stressed and un- 
stressed syllables are discernible. From the model a 
model is modelled of the fundamental tone in words and 
sentences. The invention further indicates that the fun- 
damental tone is compared with the modelled funda- 

55 mental tone, at which indication for possible accents is 
obtained The possible accents at the comparison of the 
fundamental tone and the comparison of duration are 
compared, and decision is made which type of accent 
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or stressed/unstressed syllable that exists. The decision 
is utilized to adjust the model. A text is at that produced 
which with great probability obtains one with the speech 
corresponding meaning. At the creation of the model 
lexical information is utilized. The lexical information in- s 
dicates alternative accents in the words. The lexical in- 
formation further indicates alternative durations for dif- 
ferent segments in the words which are recognized. 
Syntax analysis of the model is utilized at modelling of 
the fundamental tone in the sentence. The syntax anal- io 
ysis of the model is utilized at the modelling of the sen- 
tences. 

The invention further relates to a device at speech- 
to-text conversion. A fundamental tone is extracted from 
a speech in a speech recognition equipment. A model '5 
of the speech is created in the speech recognition equip- 
ment. From the model a duration reproduction in words 
and sentences is created. The device further is arranged 
to compare the duration reproduction with a segment 
duration in the speech. Decision regarding type of ac- 20 
cent is performed in the device on the basis of the com- 
parison. A text with sentence accent information is pro- 
duced. Sentence accents of type 1 and 2 are discerni- 
ble, as well as stressed and unstressed syllables. From 
the mode! a model of the fundamental tone in words and 25 
sentences is produced. The extracted fundamental tone 
is compared with the modelled fundamental tone and an 
indication of possible locations of accents is obtained. 
The possible accents at the fundamental tone compar- 
ison are compared and decision is made regarding 30 
which type of accent or stressed/unstressed syllables 
that exist. The decision is utilized for correction of the 
model and a text is produced which with great proba- 
bility corresponds with the meaning of the speech. Lex- 
ical information is utilized at the creation of the model. 35 
In the lexical information is included information about 
different types of accents respective stressed/un- 
stressed syllables etc in different words and sentences. 
By means of the lexical information alternative accents 
and accent locations are obtained in the words which 40 
have been obtained from the lexical information. Alter- 
native durations fordifferent segments in the recognized 
words are obtained from the lexical information. At mod- 
elling of the fundamental tone in sentences, a syntax 
analysis of the model is utilized. At modelling of the sen- 
tences the syntax analysis of the model is utilized. 

ADVANTAGES 

The invention allows that a prosodic information is 50 
utilized at speech analysis, at which an increased un- 
derstanding of the speech is obtained. The increased 
understanding will increase the possibility to utilize spo- 
ken information in different connections, for instance 
translation from a speech into another speech at auto- 55 
malic speech translation. The invention further allows 
an increased possiblity to utilize spoken information in 
different connections for control of different services in 



a telecommunications network, at control of different de- 
vices, computers etc. 

DESCRIPTION OF FIGURES 

Figure 1 shows the invention in the form of a block 
diagram. 

DETAILED EMBODIMENT 

In the following the invention is described on the ba- 
sis of the figures and the terms therein. 

A produced speech is fed into a speech recognition 
equipment. 1. In the speech recognition equipment the 
speech is analysed in its components. At this different 
recognized sequences appear which are made up to 
words and sentences. The analysis which is performed 
in the speech recognition equipment is performed with 
for the professional in the field wellknown technology. 
Consequently, for instance Hidden Markov Models, 
HMM, can be utilized. In this type of analysis the funda- 
mental tone and the duration information are regarded 
as disturbances. Information regarding the duration of 
the segments is however possible to derive in the Mark- 
ov model. By the analysis in the speech recognition 
equipment are obtained a number of recognized sounds 
which are put together to words and sentences. One 
consequently obtains a set of combinations of syllables 
which are possible to combine to different words. Said 
words consist of words which exist in the language, re- 
spective words which do not exist in the language. In a 
first check of the recognized words, possible combina- 
tions are transferred to a lexicon, 2. The lexicon consists 
of a norma! lexicon with pronounciation and stress in- 
formation. In the lexicon different possible words are 
checked, which can be created from the recognized 
speech segments. From the lexicon information, infor- 
mation about the possible words which can exist based 
on the recognized speech is fed back. In the speech rec- 
ognition equipment after that a compilation of the words 
is made to clauses and sentences. This information is 
transferred to a syntax analysis, 3. In the syntax analysis 
is checked whether the suggestions to clauses and sen- 
tences which have occured are, from a linguistic point 
of view acceptable or not acceptable in the language. 
The lexical and syntactical information is after that trans- 
ferred to a fundamental tone modulating unit 5. and a 
duration modulating unit. 6. In the fundamental tone 
modulating unit the fundamental tone is modulated on 
the basis of the lexical and syntactical information. At 
that a fundamental tone modulation in words and sen- 
tences is obtained. The obtained information is trans- 
ferred to a comparator. 7. which also obtains an infor- 
mation regarding the fundamental tone of the speech 
which has been extracted i the fundamental tone extrac- 
tor. 4. At the comparison in 7 information about possible 
locations of the sentence accent, accent 1 and accent 
2 is obtained. 
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From the lexical and syntactical analysis a model of 
the duration in words and sentences is also produced. 
At that the lexicon contains information about the dura- 
tion of different syllables in the possible words which 
have been obtained at the analysis of the speech. In 
syntax analyses also possible durations for different 
parts in the sentences which are possible and in the dif- 
ferent words are produced. From the total information a 
segment duration, where the duration of the wowels and 
possibly following consonants are the most important, 
is obtained. The in this way obtained information is 
transferred to a second comparator. 8. The comparator. 
8. also obtains an information segment duration in the 
real speech from the speech recognition equipment At 
the comparison in the comparator 8. information about 
possible locations for accent 1 , accent 2. stressed or un- 
stressed syllables and sentence accents is obtained. 
This information is transferred to a decision-maker 9 : 
which has also received information from the first com- 
parator 7. regarding sentence accent 1 and accent 2 
from the fundamental tone information. The decision- 
maker after that compiles the information from the two 
comparators and decides whether accent 1, accent 2. 
stressed or unstressed syllable or sentence accent ex- 
ists. The obtained information is after that fed back to 
the speech recognition equipment which modifies the 
original model and after that feeds out a text with sen- 
tence accent information. 

By the suggested solution a possibility is given to 
recognize a speech and reproduce it in a correct way 
with better accuracy than in previously known methods. 
The in the original speech given meaning can at that be 
reproduced in a correct way. Further the information can 
be utilized in the case the given speech shall be trans- 
lated into another language. Further possibility is given, 
in a correct way to find right word and expression and 
determine which of alternative meanings that shall be 
utilized at the analysis of words and sentences, The un- 
certainty at previous methods., principally statistical 
methods, to decide the proper sense of different words, 
is by the suggested method reduced in a drastical way. 

The invention is not restricted to the in the descrip- 
tion presented embodiment, or by the patent claims, but 
can be subject to modifications within the frame of the 
idea of invention. 



Claims 

1. Method at speech-to-text-conversion, at which a 
fundamental tone is extracted from a speech, and 
from the speech a model of the speech is created, 
characterized in that from the model a duration re- 
production in words and sentences is obtained, that 
the duration reproduction is compared with a seg- 
ment duration in the speech, that from the compar- 
ison is decided which type of accent that exists, and 
that a text with sentence accent information is pro- 



duced. 

2. Method according to patent claim 1 characterized 
in that accent 1 , accent 2. and sentence accents are 

5 discerned. 

3. Method according to patent claim ^ characterize- 
din that stressed and unstressed syllables are dis- 
cerned. 

w 

4. Method according to any of the previous patent 
claims, characterized in that from the model a 
model of the fundamental tone in words and sen- 
tences is modelled. 

is 

5. Method according to any of the previous patent 
claims, characterized in that the extracted funda- 
mental tone is compared with the modelled funda- 
mental tone at which indication for possible accents 

20 are obtained. 

6. Method according to any of the previous patent 
claims, characterizedin that the possible accents 
at the fundamental tone comparison and the dura- 

25 tion comparison are compared and decision is 
made which type of accent or stressed/ unstressed 
syllable that exists. 

7. Method according to any of the previous patent 
30 claims, characterized in that the decision is utilized 

for correction of the model at which the produced 
text with great probability gets one with the speech 
corresponding meaning. 

35 8. Method according to any of the previous patent 
claims, characterized in that at the creation of the 
model lexical information is utilized. 

9. Method according to any of the previous patent 
40 claims, characterized in that the lexical information 

indicates alternative accents in the words. 

10. Method according to any of the previous patent 
claims, characterized in that the lexical information 

45 indicates alternative durations for different seg- 
ments in the words which are recognized. 

11. Method according to any of the previous patent 
claims, characterized in that syntax analysis of the 

50 model is utilized at modelling of the fundamental 
tone in the sentence 

12. Method according to any of the previous patent 
claims, characterized in that the syntax analysis of 

55 the model is utilized at modelling of the sentences 

13. Device at speech-to-text conversion, and a funda- 
mental tone is extracted from a speech in a speech 
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recognition equipment, and a model of the speech 
is created in the speech recognition equipment, 
characterizedin that from the model a duration re- 
production in words and sentences is created, the 
device is further arranged to compare the duration s 
reproduction with a segment duration in the speech, 
decision regarding accent type is performed in the 
device on the basis of the comparison, and a text 
with sentence accent information is produced. 

10 

14. Device at speech-to-text conversion according to 
the patent claim 1 3, characterized in that accent 1 
and accent 2. and sentence accents are discerni- 
ble. 

15 

15. Device at speech-to-text conversion according to 
the patent claim 13 : characterized in that stressed 
and unstresed syllables are discernible. 



segments in the recognized words are obtained 
from the lexical information. 

23. Device at speech-to-text conversion according to 
the patent claims 13 up to and including 22. char- 
acterized in that at modelling of the fundamental 
tone, a syntax analysis of the model is utilized. 

24. Device at speech-to-text conversion according to 
the patent claims 13 up to and including 23, char- 
acterized in that, at modelling of the sentences, the 
syntax analysis of the model is utilized. 



16. Device at speech-to-text conversion according to 20 
the patent claims 13 up to and including 15 : char- 
acterized in that from the model a model of the fun- 
damental tone in words and sentences is produced. 

17. Device at speech-to-text conversion according to 25 
the patent claims 13 up to and including 16. char- 
acterized in that the extracted fundamental tone is 
compared with the modelled fundamental tone och 
that an indication for possible locations of accents 

is obtained. 30 

18. Device at speech-to-text conversion according to 
the patent claims 13 up to and including 17, char- 
acterized in that the possible accents at the funda- 
mental tone comparison are compared and that de- 35 
cision is taken regarding which type of accent, or 
stressed/unstressed syllable that exists. 

19. Device at speech-to-text conversion according to 

the patent claims 13 up to and including 18. char- 10 
acterized in that the decision is utilized for correc- 
tion of the model and a text is produced, which with 
great probability corresponds to the sense of the 
speech. 

45 

20. Device at speech-to-text conversion according to 
the patent claims 13 up to and including 19. char- 
acterized in that lexical information is utilized at the 
creation of the model. 

50 

21. Device at speech-to-text conversion according to 
the patent claims 13 up to and including 20. char- 
acterized in that alternative accents in the words 
are obtained from the lexical information. 

55 

22. Device at speech-to-text conversion according to 
the patent claims 13 up to and including 21. char- 
acterized in that alternative durations for different 
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(54) Speech recognition for tonal languages 

(57) The present invention relates to a method and 
device at speech-to-text conversion. From a given 
speech the fundamental tone is extracted. A model of 
the speech is further created from the speech. In the 
model a duration reproduction in words and sentences 



is obtained. The duration reproduction is compared with 
a segment duration in the speech. From the comparison 
is obtained information which decides which type of ac- 
cent that exists, at which a text with sentence accent 
information is produced. 
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